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Abstract 

For several years, MPI has been the de facto standard 
for writing parallel applications. One of the most popular 
MPI implementations is MPICH. Its successor, MPICH2, 
features a completely new design that provides more per- 
formance and flexibility. To ensure portability, it has a hi- 
erarchical structure based on which porting can be done at 
different levels. 

In this paper, we present our experiences designing and 
implementing MPICH2 over InfiniBand. Because of its high 
performance and open standard, InfiniBand is gaining pop- 
ularity in the area of high-performance computing. Our 
study focuses on optimizing the performance of MPI-1 func- 
tions in MPICH2. One of our objectives is to exploit Remote 
Direct Memory Access (RDMA) in Infiniband to achieve 
high performance. We have based our design on the RDMA 
Channel interface provided by MPICH2, which encapsu- 
lates architecture-dependent communication functionalities 
into a very small set of functions. 

Starting with a basic design, we apply different optimiza- 
tions and also propose a zero-copy-based design. We char- 
acterize the impact of our optimizations and designs using 
microbenchmarks. We have also performed an application- 
level evaluation using the NAS Parallel Benchmarks. Our 
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optimized MPICH2 implementation achieves 7.6 p,s latency 
and 857 MB/s bandwidth, which are close to the raw perfor- 
mance of the underlying InfiniBand layer. Our study shows 
that the RDMA Channel interface in MPICH2 provides a 
simple, yet powerful, abstraction that enables implemen- 
tations with high performance by exploiting RDMA oper- 
ations in InfiniBand. To the best of our knowledge, this 
is the first high-performance design and implementation of 
MPICH2 on InfiniBand using RDMA support. 



1 Introduction 

During the past ten years, the research and industry com- 
munities have proposed and implemented user-level com- 
munication systems to address some of the problems asso- 
ciated with traditional networking protocols. The Virtual 
Interface Architecture (VIA) [6 1 was proposed to standard- 
ize these efforts. More recently, the InfiniBand Architec- 
ture 1 8 1 has been introduced, which combines storage I/O 
with interprocess communication. 

In addition to send and receive operations, Infini- 
Band architecture supports Remote Direct Memory Access 
(RDMA). RDMA operations enable direct access to the ad- 
dress space of a remote process. These operations introduce 
new opportunities and challenges in designing communica- 
tion protocols. 

In the area of high-performance computing, MPI 1231 
has been the de facto standard for writing parallel applica- 
tions. After the original MPI standard (MPI-1), an enhanced 
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standard (MPI-2) 1171 was introduced, which includes fea- 
tures such as dynamic process management, one-sided com- 
munication, and I/O. MPICH 1 7 1 is one of the most popular 
MPI-1 implementations. Recently, work has begun to cre- 
ate MPICH2 (T), which aims to support both MPI-1 and 
MPI-2 standards. It features a completely new design that 
provides better performance and flexibility than its prede- 
cessor. MPICH2 is also very portable and provides mech- 
anisms that make it easy to retarget MPICH2 to new com- 
munication architectures such as InfiniBand. 

In this paper, we present our experiences in designing 
and implementing MPICH2 over InfiniBand using RDM A 
operations. Although MPICH2 supports both MPI-1 and 
MPI-2, our study focuses on optimizing the performance of 
MPI-1 functions. We have based our design on the RDMA 
Channel interface provided by MPICH2, which encapsu- 
lates architecture-dependent communication functionalities 
in a small set of functions. Despite its simple interface, 
we have shown that the RDMA Channel does not pre- 
vent one from achieving high performance. In our testbed, 
our MPICH2 implementation achieves 7.6 /is latency and 
857 MB/s peak bandwidth, which are quite close to the raw 
performance of the InfiniBand platform. We have also eval- 
uated our designs using the NAS Parallel Benchmarks |25 1. 
Overall, we have demonstrated that the RDMA Channel in- 
terface is a simple, yet powerful, abstraction that makes it 
possible to design high-performance MPICH2 implementa- 
tions with less development effort. 

In our design, communication between processes is done 
exclusively using RDMA operations. Our design starts with 
an emulation of a shared-memory-based implementation. 
Then we introduce various optimization techniques to im- 
prove its performance. To evaluate the impact of each opti- 
mization, we use latency and bandwidth microbenchmarks. 
We also propose a zero-copy design for large messages. 
Our results show that with piggybacking and zero-copy opti- 
mizations for large messages, our design achieves very good 
performance. 

The remainder of the paper is organized as follows. In 
Section 2, we provide an introduction to InfiniBand and its 
RDMA operations. In Section 3, we present an overview 
of MPICH2, its implementation structure, and the RDMA 
Channel interface. In Sections 4 and 5, we describe our de- 
signs and implementations. In Section 6, we compare our 
RDMA Channel-based design with another design based on 
a more complicated interface called CH3. In Section 7, we 
present an application level performance evaluation. In Sec- 
tion 8, we describe related work. In Section 9, we draw 
conclusions and briefly mention some future research di- 
rections. 



2 InfiniBand Overview 

The InfiniBand Architecture (IBA) 1 8 1 defines a switched 
network fabric for interconnecting processing nodes and I/O 
nodes. It provides a communication and management in- 
frastructure for interprocessor communication and I/O. In 
an InfiniBand network, processing nodes and I/O nodes 
are connected to the fabric by channel adapters (CAs). 
Channel adapters usually have programmable DMA en- 
gines with protection features. There are two kinds of chan- 
nel adapters: host channel adapter (HCA) and target chan- 
nel adapter (TCA). HCAs sit on processing nodes. 

The InfiniBand communication stack consists of differ- 
ent layers. The interface presented by channel adapters to 
consumers belongs to the transport layer. A queue-based 
model is used in this interface. A queue pair in InfiniBand 
Architecture consists of two queues: a send queue and a re- 
ceive queue. The send queue holds instructions to transmit 
data, and the receive queue holds instructions that describe 
where received data is to be placed. Communication opera- 
tions are described in work queue requests (WQRs), or de- 
scriptors, and are submitted to the work queue. The comple- 
tion of WQRs is reported through completion queues (CQs). 
Once a work queue element is finished, a completion queue 
entry is placed in the associated completion queue. Appli- 
cations can check the completion queue to see whether any 
work queue request has been finished. InfiniBand also sup- 
ports different classes of transport service. In this paper, we 
focus on the reliable connection (RC) service. 

2.1 RDMA Operations in InfiniBand Architec- 
ture 

InfiniBand Architecture supports both channel and mem- 
ory semantics. In channel semantics, send/receive opera- 
tions are used for communication. To receive a message, the 
programmer posts a receive descriptor that describes where 
the message should be put at the receiver side. At the sender 
side, the programmer initiates the send operation by posting 
a send descriptor. 

In memory semantics, InfiniBand supports remote di- 
rect memory access (RDMA) operations, including RDMA 
write and RDMA read. RDMA operations are one sided and 
do not incur software overhead at the remote side. In these 
operations, the sender (initiator) starts RDMA by posting 
RDMA descriptors. A descriptor contains both the local 
data source addresses (multiple data segments can be spec- 
ified at the source) and the remote data destination address. 
At the sender side, the completion of an RDMA operation 
can be reported through CQs. The operation is transparent 
to the software layer at the receiver side. 

Since RDMA operations enable a process to access the 
address space of another process directly, they have raised 
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some security concerns. In InfiniBand architecture, a key- 
based mechanism is used. A memory buffer must first be 
registered before it can be used for communication. Among 
other things, the registration generates a remote key. This 
remote key must be presented when the sender initiates an 
RDMA operation to access the buffer. 

Compared with send/receive operations, RDMA oper- 
ations have several advantages. First, RDMA operations 
themselves are generally faster than send/receive operations 
because they are simpler at the hardware level. Second, they 
do not involve managing and posting descriptors at the re- 
ceiver side, which would incur additional overheads and re- 
duce the communication performance. 

3 MPICH2 Overview 

MPICH 1 7 1 is developed at Argonne National Labora- 
tory. It is one of the most popular MPI implementations. 
The original MPICH provides support for the MPI- 1 stan- 
dard. As a successor of MPICH, MPICH2 | Q aims to sup- 
port not only the MPI-1 standard but also functionalities 
such as dynamic process management, one-sided commu- 
nication, and MPI I/O, which are specified in the MPI-2 
standard. However, MPICH2 is not merely MPICH with 
MPI-2 extensions. It is based on a completely new design, 
aiming to provide more performance, flexibility, and porta- 
bility than the original MPICH. One of the notable features 
in the implementation of MPICH2 is that it can take advan- 
tage of RDMA operations if they are provided by the under- 
lying interconnect. These operations can be used not only 
to support MPI-2 one-sided communication but also to im- 
plement normal MPI- 1 communication. Although MPICH2 
is still under development, beta versions are already avail- 
able for developers. In the current version, all MPI-1 func- 
tions have been implemented. MPI-2 functions are not com- 
pletely supported yet. In this paper, we mainly focus on the 
MPI-1 part of MPICH2. 

3.1 MPICH2 Implementation Structure 

One of the objectives in MPICH2 design is portability. 
To facilitate porting MPICH2 from one platform to another, 
MPICH2 uses ADD (the third generation of the Abstract 
Device Interface) to provide a portability layer. ADD is a 
full-featured abstract device interface and has many func- 
tions, so it is not a trivial task to implement all of them. To 
reduce the porting effort, MPICH2 introduces the CH3 in- 
terface. CH3 is a layer that implements the ADI3 functions 
and provides an interface consisting of only a dozen func- 
tions. A "channel" implements the CH3 interface. Channels 
exist for different communication architectures such as TCP 
sockets and, SHMEM. Because only a dozen functions are 



associated with each channel interface, it is easier to imple- 
ment a channel than the ADI3 device. 

To take advantage of architectures with globally shared 
memory or RDMA capabilities and to further reduce the 
porting overhead, MPICH2 introduces the RDMA Channel 
which implements the CH3 interface. The RDMA Channel 
interface only contains five functions. We will discuss the 
details of the RDMA Channel interface in the next subsec- 
tion. 

The hierarchical structure of MPICH2 , as shown in Fig- 
ure [2 gives much flexibility to implementors. The three 
interfaces (ADD, CH3, and the RDMA Channel interface) 
provide different trade-offs between communication perfor- 
mance and ease of porting. 
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Figure 1. MPICH2 Implementation Structure 



3.2 MPICH2 RDMA Channel Interface 

The MPICH2 RDMA Channel interface is designed 
specifically for architectures with globally shared memory 
or RDMA capabilities. It contains five functions, among 
which only two are central to communication. (Other func- 
tions deal with process management, initialization, and fi- 
nalization.) These two functions are called put (write) and 
get (read). 

Both put and get functions accept a connection structure 
and a list of buffers as parameters. They return to the caller 
the number of bytes that have been successfully put or got- 
ten. If the number of bytes completed is less than the total 
length of buffers, the caller will retry the same get or put 
operation later. 

Figure 13 illustrates the semantics of put and get. Logi- 
cally, a pipe is shared between the sender and the receiver. 
The put operation writes to the pipe, and the get operation 
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reads from it. The data in the pipe is consumed in FIFO or- 
der. Both operations are nonblocking in the sense that they 
return immediately with the number of bytes completed, in- 
stead of waiting for the entire operation to finish. We note 
that put and get are different from RDMA write and RDMA 
read in InfiniBand. While RDMA operations in InfiniBand 
are one sided, put and get in the RDMA Channel interface 
are essentially two-sided operations. 

Put and get operations can be implemented on archi- 
tectures with globally shared memory in a straightforward 
manner. Figure [5] shows one example. In this implementa- 
tion, a shared buffer (organized logically as a ring) is placed 
in shared memory, together with a head pointer and a tail 
pointer. The put operation copies the user buffer to the 
shared buffer and adjusts the head pointer. The get oper- 
ation involves reading from the shared buffer and adjusting 
the tail pointer. In the case of buffer overflow or underflow 
(detected by comparing head and tail pointers), the opera- 
tions return immediately, and the caller will retry them. 

Sender Receiver 



Head Pointer 




FIFO 



Get 




I I Put/Get Operations 
□ Buffers 
k Buffer Pointers 

Figure 2. Put and Get Operations 

Working at the RDMA Channel interface level is better 
than writing a new CH3 or ADD implementation for many 
reasons: 

1. Improvements done at this level can affect all shared- 
memory-like transports such as globally shared mem- 
ory, RDMA over IP, Quadrics, and Myrinet. 

2. Other protocols on InfiniBand need efficient process- 
ing, including one-sided communication in MPI-2, 
DSM systems, and parallel file systems. The RDMA 
Channel interface can potentially be used also for 
them. 

3. Designing proper interfaces to similar systems im- 
proves performance and portability in general. 

In collaboration, the OSU and ANL teams are also cur- 
rently working together to design an improved interface that 
can benefit communication systems in general. 




Data 



Empty 



Figure 3. Put and Get Implementation with 
Globally Shared Memory 



4 Designing and Optimizing MPICH2 over 
InfiniBand 

In this section, we present several different designs of 
MPICH2 over InfiniBand based on the RDMA Channel in- 
terface. We first start with a basic design that resembles the 
scheme described in Figure[3] Then we apply various opti- 
mization techniques to improve its performance. In this sec- 
tion, the designs are evaluated by using microbenchmarks 
such as latency and bandwidth. We show that by taking ad- 
vantage of RDMA operations in InfiniBand, we can achieve 
not only low latency for small messages but also high band- 
width for large messages using the RDMA Channel inter- 
face. In Section 5, we present a zero-copy design. 

4.1 Experimental Testbed 

Our experimental testbed consists of a cluster system 
with 8 SuperMicro SUPER P4DL6 nodes. Each node 
has dual Intel Xeon 2.40 GHz processors with a 512K L2 
cache and a 400 MHz front side bus. The machines are 
connected by Mellanox InfiniHost MT23108 DualPort 4X 
HCA adapter through an InfiniScale MT43132 Eight 4x 
Port InfiniBand Switch. The HCA adapters work under the 
PCI-X 64-bit 133MHz interfaces. We used the Linux Red 
Hat 7.2 operating system with 2.4.7 kernel. The compilers 
we used were GNU GCC 2.96 and GNU FORTRAN 0.5.26. 

4.2 Basic Design 

In Figure|3] we illustrated how the RDMA Channel inter- 
face can be implemented on shared-memory architectures. 
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In a cluster connected by InfiniBand, however, there is no 
physically shared memory. In our basic design, we use 
RDMA write operations provided by InfiniBand to emulate 
this scheme. 

We put the shared-memory buffer in the receiver's main 
memory. This memory is registered and exported to the 
sender. Therefore, it is accessible to the sender through 
RDMA operations. To avoid the relatively high cost of reg- 
istering user buffers for sending every message, we also use 
a preregistered buffer at the sender that is the same size as 
the shared buffer at the receiver. User data is first copied 
into this buffer and then sent out. Head and tail pointers 
also need to be shared between the sender and the receiver. 
Since they are used frequently at both sides, we use repli- 
cation to prevent polling through the network. For the tail 
pointer, a master copy is kept at the receiver, and a replica 
is kept at the sender. For the head pointer, a master copy is 
kept at the sender, and a replica is kept at the receiver. For 
each direction of every connection, the associated "shared" 
memory buffer, head and tail pointers are registered during 
initialization, and their addresses and remote keys are ex- 
changed. 

At the sender, the put operation is implemented as fol- 
lows: 

1 . Use local copies of head and tail pointers to decide how 
much empty space is available. 

2. Copy user buffer to the preregistered buffer. 

3. Use RDMA write operation to write the data to the 
buffer at the receiver side. 

4. Adjust the head pointer based on the amount of data 
written. 

5. Use another RDMA write to update the remote copy 
of head pointer. 

6. Return the number of bytes written. 

At the receiver, the get operation is implemented in the 
following way: 

1. Check local copies of head and tail pointers to see 
whether there is new data available. 

2. Copy the data from the shared memory buffer to user 
buffer. 

3. Adjust the tail pointer based on the amount of data that 
has been copied. 

4. Use an RDMA write to update the remote copy of tail 
pointer. 

5. Return the number of bytes successfully read. 



We note that copies of head and tail pointers are not al- 
ways consistent. For example, after a sender adjusts its head 
pointer, it uses RDMA write to update the remote copy at 
the receiver. Therefore, the head pointer at the receiver is 
not up to date until the RDMA write finishes. However, this 
inconsistency does not affect the correctness of the scheme; 
it merely prevents the receiver from reading new data tem- 
porarily. Similarly, inconsistency of tail pointer may pre- 
vent the sender from writing to the shared buffer. But even- 
tually the pointers will become up to date, and the sender or 
the receiver will be able to make progress. 

4.2.1 Performance of the Basic Design 

We use latency and bandwidth tests to evaluate the perfor- 
mance of our basic design. The latency test is conducted in a 
ping-pong fashion, and the results are derived from round- 
trip time. In the bandwidth test, a sender keeps sending 
back-to-back messages to the receiver until it has reached a 
predefined window size W. Then it waits for these messages 
to finish and send out another W messages. The results are 
derived from the total test time and the number of bytes sent. 

Figures |4] and |5] show the results. Our basic design 
achieves a latency of 18.6 /is for small messages and a band- 
width of 230 MB/s for large messages. (Unless stated oth- 
erwise, the unit MB in this paper is an abbreviation for 10 6 
bytes, not 2 20 bytes.) These numbers are much worse than 
the raw performance numbers achievable by the underlying 
InfiniBand layer (5.9 /is latency and 870 MB/s bandwidth). 

100 i . . . . . 1 




Message Size (Bytes) 
Figure 4. MPI Latency for Basic Design 

A careful look at the basic design reveals many ineffi- 
ciencies. For example, a matching pair of send and receive 
operations in MPI require three RDMA write operations to 
take place: one for transfer of data, and two for updating 
head and tail pointers. These not only increase latency and 
host overhead but also generate unnecessary network traffic. 
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from the receiver to the sender, the pointer value is attached 
with the message, and no extra message is used to transfer 
pointer updates. If no messages are sent from the receiver 
to the sender, eventually we will explicitly send the updates 
by using an extra message. The sender updates its pointer 
after receiving this message. Even in this case, however, the 
traffic can be reduced because several consecutive updates 
of the tail pointer can be sent by using only one message. 

The use of piggybacking and delayed pointer updates 
can greatly improve the performance of small message. 
From Figure [6] we see that the latency is reduced from 
18.6 /is to 7.4 /is. Figure0shows that the optimization also 
improves bandwidth for small messages. 



Figure 5. MPI Bandwidth for Basic Design 

For large messages, the basic scheme leads to two ex- 
tra memory copies. The first one is from user buffer to the 
preregistered buffer at the sender side. The second one is 
from the shared buffer to user buffer at the receiver side. 
These memory copies consume resources such as memory 
bandwidth and CPU cycles. To make matters worse, in the 
basic design the memory copies and communication oper- 
ations are serialized. For example, a sender first copies the 
whole message (or part of the message if it cannot fit itself 
in the empty space of the preregistered buffer). Then it ini- 
tiates RDMA write to transfer the data. This serialization 
of copying and RDMA write greatly reduces the bandwidth 
for large messages. 

4.3 Optimization with Piggybacking Pointer Up- 
dates 
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Figure 6. MPI Small-Message Latency with 
Piggybacking 



Our first optimization targeted to avoid separate head and 
tail pointer updates whenever possible. The technique we 
used is piggybacking, which combines pointer updates with 
data transfer. 

At the sender side, we combine data and the new value 
of head pointer into a single message. To help the receiver 
detect the arrival of the message, we attach the size with the 
message and put two flags at the beginning and the end of 
the message. The receiver detects arrival of the new mes- 
sage by polling on the flags. To avoid possible situations 
where the buffer content happens to have the same value as 
the flag, we divide the shared buffer into fixed-sized chunks. 
Each message uses a different chunk. In this way, the situa- 
tions can be handled by using two polling flags, or "bottom 
fill." Similar techniques have been used in H16II21 1 . 

At the receiver side, instead of using RDMA write to up- 
date the remote tail pointer each time data has been read, 
we delay the updates until the free space in the shared 
buffer drops below a certain threshold. If messages are sent 
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Figure 7. MPI Small-Message Bandwidth with 
Piggybacking 
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4.4 Optimization with Pipelining of Large Mes- 
sages 



As we have discussed, our basic design suffers from se- 
rialization of memory copies and RDMA writes. A better 
solution is to use pipelining to overlap memory copies with 
RDMA write operations. 

In our piggybacking optimization, we divide the shared- 
memory buffer into small chunks. When sending and re- 
ceiving large messages, we need to use more than one such 
chunks. At the sender side, instead of starting RDMA writes 
after copying all the chunks, we initiate the RDMA trans- 
fer immediately after copying each chunk. In this way, the 
RDMA operation can be overlapped with the copying of the 
next chunk. Similarly, at the receiver side we start copying 
from the shared buffer to the user buffer immediately after 
a chunk is received. In this way, the receive RDMA opera- 
tions can be overlapped with the copying. 

Figure [8] compares the bandwidth of the pipelining 
scheme with the basic scheme. (Piggybacking is also used 
in the pipelining scheme.) We can see that pipelining 
combined with piggybacking has greatly improved MPI 
bandwidth. The peak bandwidth has been increased from 
230 MB/s to over 500 MB/s. This result is still not satis- 
fying, however, because InfiniBand is able to deliver band- 
width up to 870 MB/s. 

To investigate the performance bottleneck, we have con- 
ducted memory copy tests in our testbed. We have found 
that memory copy bandwidth is less than 800 MB/s for large 
messages. In our MPI bandwidth tests, with RDMA write 
operations and memory copies both using the memory bus, 
the bandwidth achievable at the application level is even 
less. Therefore, the memory bus clearly becomes a perfor- 
mance bottleneck for large messages because of the extra 
memory copies. 

In the pipelining optimization, it is important that we bal- 
ance each stage of the pipeline so that we can get maxi- 
mum throughput. One parameter we can change to balance 
pipeline stages is the chunk size, or how much data we copy 
each time for a large message. Figure |9] shows MPI band- 
width for different chunk sizes for the pipelining optimiza- 
tion. We observe that MPI does not give good performance 
when the chunk size is either too small (IK Bytes) or too 
large (32K Bytes). MPI performs comparably for chunk 
sizes of 2K to 16K Bytes. In all remaining tests, we have 
chosen a chunk size of 16K Bytes. 

5 Zero- Copy Design 
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Figure 9. MPI Bandwidth with Pipelining (Dif- 
ferent Chunk Sizes) 



As we have discussed in the preceding section, it is de- 
sirable to avoid memory copies for large messages. In this 
section, we describe a zero-copy design for large messages 
based on the RDMA Channel interface. 
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In our new design, small messages are still transferred 
by using RDMA write, similar to the piggybacking scheme. 
For large messages, however, RDMA read, instead of 
RDMA write, is used for data transfer. The basic idea of 
our zero-copy design is to let the receiver "pull" the data 
directly from the sender using RDMA read. 

For each connection, shared buffers are still used for 
transferring small messages. However, the data for large 
messages is not transferred through the shared buffer. At 
the sender, when the put function is called, it checks the 
user buffer and decides whether to use zero-copy or not, 
based on the buffer size. If zero-copy is not used, the mes- 
sage is sent through the shared buffer as discussed before. If 
zero-copy is used, the function registers the user buffer, con- 
structs a special packet that contains information about the 
user buffer such as address, size, and remote key, then sends 
the special packet by using RDMA write through the shared 
buffer. The put function returns a value of 0, at this stage, 
because no data has been transferred yet. Subsequent calls 
to put also return until all of the data has been transferred, 
and the operation has completed. Once the operation has 
completed, put will return the number of bytes transferred. 

When the packet arrives at the other side and the get 
function is called, the receiver checks the shared buffer and 
processes all the packets in order. If a packet is a data 
packet, the data is copied to the user buffer. If it is a spe- 
cial packet, the user buffer is registered, and an RDMA read 
operation is issued to fetch the data from the remote side di- 
rectly to the user buffer. After initiating the RDMA read, 
the get function returns with a value of 0, because the op- 
eration is still in progress. When the RDMA read is fin- 
ished, calling the get function leads to an acknowledgment 
packet being sent to the sender. The get function then re- 
turns the number of bytes successfully transferred. When 
the acknowledgment packet is received at the sender side, 
the sender deregisters the user buffer, completing the oper- 
ation, and the next call to the put function will return the 
number of bytes transferred. The zero-copy process is illus- 
trated in FigurelTol 

In the current InfiniBand implementation, memory reg- 
istration and deregistration are expensive operations. To 
reduce the number of registrations and deregistrations, we 
have implemented a registration cache |24|. The basic idea 
is to delay the deregistration of user buffers and put them 
into a cache. If the same buffer is reused later, its regis- 
tration information can be fetched directly from the cache 
instead of going through the expensive registration process. 
Deregistration happens only when there are too many regis- 
tered user buffers. 

We note that the effectiveness of registration cache de- 
pends on buffer reuse patterns of applications. If applica- 
tions rarely reuse buffers for communication, registration 
overhead cannot be avoided most of the time. Fortunately, 
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Figure 10. Zero-Copy Design 

our previous study with the NAS Parallel Benchmarks 1 15 1 
has demonstrated that buffer reuse rates are very high in 
these applications. 

We compare the bandwidth of the pipelining design 
and the zero-copy design in Figure ^2 We observe that 
zero-copy greatly improves the bandwidth for large mes- 
sages. We achieve a peak bandwidth of 857 MB/s, which 
is quite close to the peak bandwidth at the InfiniBand level 
(870 MB/s). We also see that as a result of cache effect, 
bandwidth for large messages drops for the pipelining de- 
sign. Because of the extra overhead in the implementation, 
the zero-copy design slightly increases the latency for small 
messages, which is now around 7.6 fis. 
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Figure 1 1 . MPI Bandwidth with Zero-Copy and 
Pipelining 



Our zero-copy implementation uses RDMA read oper- 
ations, which let the receiver "pull" data from the sender. 
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An alternative is to use RDMA write operations and let 
the sender "push" data to the receiver. Before the sender 
can push the data, the receiver has to use special packets 
to advertise availability of new receive buffers. Therefore, 
this method can be very efficient if the get operations are 
called before the corresponding put operations. In the cur- 
rent MPICH2 implementation, however, the layers above 
the RDMA Channel interface are implemented in such a 
way that get is always called after put for large messages. 
Therefore, we have chosen an RDMA read-based imple- 
mentation instead of RDMA write. 

6 Comparing CH3 and RDMA Channel In- 
terface Designs 

The RDMA Channel interface in MPICH2 provides a 
simple way to implement MPICH2 in many communication 
architectures. In the preceding section, we showed that this 
interface does not prevent one from achieving good perfor- 
mance. Nor does it prevent zero-copy implementation for 
large messages. Our results showed that with various op- 
timizations, we can achieve a latency of 7.6 p,s and a peak 
bandwidth of 857 MB/s. 

The CH3 interface is more complicated than the RDMA 
Channel interface. Therefore, porting it requires more ef- 
fort. However, since CH3 provides more flexibility, it is 
possible to achieve better performance at this level. 

To study the impact of different interfaces on MPICH2 
performance, we have also done a CH3-level implementa- 
tion. This implementation uses RDMA write operations for 
transferring large messages, as shown in Figure El Before 
transferring the message, a handshake happens between the 
sender and the receiver. User buffer at the receiver is reg- 
istered and its information is sent to the sender through the 
handshake. The sender then uses RDMA write to transfer 
the data. A registration cache is also used in this implemen- 
tation. 

FiguresEl anc lE]compare this implementation with our 
RDMA Channel-based zero-copy design using latency and 
bandwidth microbenchmarks. We see that the two imple- 
mentations perform comparably for small and large mes- 
sages. However, the CH3-based design outperforms the 
RDMA Channel-based design for mid-sized messages (32K 
to 256K Bytes) in bandwidth. 

Figure El shows the bandwidth of RDMA read and 
RDMA write at the InfiniBand VAPI level. (VAPI is the 
programming interface for our InfiniBand cards.) With 
the current VAPI implementation, RDMA write operations 
have a clear advantage over RDMA read for mid-sized mes- 
sages. Therefore, the fact that CH3-based design outper- 
forms RDMA Channel-based design for mid-sized mes- 
sages is more the result of the raw performance difference 



between RDMA write and RDMA read than the designs 
themselves. 
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Figure 12. CH3 Zero-Copy with RDMA Write 
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Figure 13. MPI Latency for CH3 Design and 
RDMA Channel Interface Design 



7 Application-Level Evaluation 

In this section, we carry out an application-level evalu- 
ation of our MPICH2 designs using NAS Parallel Bench- 
marks |25|. We run class A benchmarks on 4 nodes and 
class B benchmarks on 8 nodes. Benchmarks SP and BT 
require a square number of nodes. Therefore, their results 
are only shown for 4 nodes. 

The results are shown in Figures E] and El We have 
evaluated three designs: RDMA Channel implementation 
with pipelining for large messages (Pipelining), RDMA 
Channel implementation with zero-copy for large messages 
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Figure 14. MPI Bandwidth for CH3 Design and 
RDMA Channel Interface Design 



(RDMA Channel), and CH3 implementation with zero- 
copy (CH3). Although the performance difference of these 
three designs is not much, we observe that the pipelining de- 
sign performs the worst in all cases. The RDMA Channel- 
based zero-copy design performs very close to the the CH3- 
based zero-copy design. On average, the CH3-based design 
performs less than 1% better on both 4 nodes and 8 nodes. 
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Figure 16. NAS Class A on 4 Nodes 
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Figure 17. NAS Class B on 8 Nodes 



8 Related Work 

As the predecessor of MPICH2 and one of the most pop- 
ular MPI implementations, MPICH supports a similar im- 
plementation structure as MPICH2. MPICH provides ADI2 
(the second generation of Abstract Device Interface) and 
Channel interface. Various implementations exist based on 
these interfaces QU fJU| Our MVAPICH implemen- 
tation 1161 . which exploits RDMA write in InfiniBand, is 
based on the ADI2 interface. 

Since MPICH2 is relatively new, there exists very little 
work describing its implementations on different architec- 
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tures. In Q, a CH3-level implementation based on TCP/IP 
is described. Work in |22| presents an implementation 
MPICH2 over InfiniBand, also using the CH3 interface. 
However, in our paper, our focus is on the RDMA Chan- 
nel interface instead of the CH3 interface. MPICH2 is de- 
signed to support both MPI-1 and MPI-2 standards. There 
have been studies about supporting the MPI-2 standard, es- 
pecially one-sided communication operations 1 51 1101 . To 
date, we have concentrated on supporting MPI- 1 functions 
in MPICH2. We plan to explore the support of MPI-2 func- 
tions in the future. 

Because of its high bandwidth and low latency, the In- 
finiBand Architecture has been used as the communication 
subsystem in a number of systems other than MPI, such 
as distributed shared-memory systems and parallel file sys- 
tems rrarBi . 

The RDMA Channel interface presents a stream-based 
abstraction somewhat similar to the traditional socket inter- 
face. There have been studies of how to implement user- 
level socket interface efficiently over high-speed intercon- 
nects such as Myrinet, VIA, and Gigabit Ethernet 1191 fT2l 
|4). Recently, Socket Direct Protocol (SDP) (5) has been 
proposed, which provides a socket interface over Infini- 
Band. The idea of our zero-copy scheme is similar to the 
Z-Copy scheme in SDP. However, there are also differences 
between the RDMA Channel interface and the traditional 
socket interface. For example, put and get functions in 
RDMA Channel interface are nonblocking, while functions 
in the traditional sockets are usually blocking. To support 
traditional socket interface, one has to make sure the same 
semantics are maintained. We do not have to deal with this 
issue for the RDMA Channel interface. 

9 Conclusions and Future Work 

In this paper, we present a study of using RDMA oper- 
ations to implement MPICH2 over InfiniBand. Our work 
takes advantage of the RDMA Channel interface provided 
by MPICH2. 

The RDMA Channel interface provides a very small set 
of functions to encapsulate the underlying communication 
layer on which the whole MPICH2 implementation is built. 
Consisting of only five functions, the RDMA Channel in- 
terface is easy to implement for different communication 
architectures. However, the question arises whether this ab- 
straction is powerful enough that one can still achieve good 
performance. 

Our study has shown that the RDMA Channel interface 
still provides the implementors much flexibility. With opti- 
mizations such as piggybacking, pipelining, and zero-copy, 
MPICH2 is able to deliver good performance to the applica- 
tion layer. For example, one of our designs achieves 7.6 {is 
latency and 857 MB/s peak bandwidth, which come quite 



close to the raw performance of InfiniBand. In our study, 
we characterize the impact of each optimization by using la- 
tency and bandwidth microbenchmarks. We have also con- 
ducted an application-level evaluation using the NAS Paral- 
lel Benchmarks. 

So far, our study has been restricted to a fairly small plat- 
form consisting of 8 nodes. In the future, we plan to use 
larger clusters to study various aspects of our designs re- 
garding scalability. Another direction we are pursuing is to 
provide support for MPI-2 functionalities such as one-sided 
communication using RDMA and atomic operations in In- 
finiBand. We are also working on how to support efficient 
collective communication on top of InfiniBand. 
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